Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

villebro · 2018-07-25T21:09:04Z

This might be slightly hacky, but I feel SQL Alchemy gives so much latitude to engines that Superset might need to be slightly more forgiving if a query result has different case compared to the datasource/form metadata. A brief summary of what this PR does:

Adjust dataframe column case, by performing case-sensitive comparison of all colums in central fields in form_data (metrics, groupby) with column names in dataframe:

Leave all matches untouched.
Replace all dataframe columns with whichever case-insensitive variant is present in form_data.

Examples:

form_data: __timestamp, dataframe: __timestamp -> Do nothing.
form_data: __timestamp, dataframe: __TIMESTAMP -> Rename dataframe column name to __timestamp.
form_data: __timeSTAMP, dataframe: __TIMESTAMP -> Rename dataframe column name to __timeSTAMP.

The dataframe is adjusted prior to caching in BaseViz.get_df_payload(), with the logic for adjustments located in db_engine_specs, which is controlled by the consistent_case_sensitivity attribute. To minimize risk of collisions, dedup has been changed to be able to perform deduping both in a case-sensitive and case-insensitive manner. Default handling would now be case-sensitive, as before, but for affected engines (Snowflake, Oracle, Redshift) handling will be case-insensitive. Example from test case (note the last Bar which is seen as a duplicate despite different case):

>>> print(','.join(dedup(['foo', 'bar', 'bar', 'bar', 'Bar'], case_sensitive=False)))
foo,bar,bar__1,bar__2,Bar__3

Other changes:

Add schema URI insertion for Snowflake db_engine_spec
Fix what seemed to be a small bug in how handle_nulls() was implemented
Remove a redundant full_table_name row in db_engine_spec

Feel free to try it out, and don't feel shy to give feedback/critique. Like stated above, this might be borderline hacky, but seems to work fairly well, and might make it easier to add new engines later on. This fix should also be backwards compatible (except for the handle_nulls()-part, which was recently added), at least 0.26 and 0.25.

codecov-io · 2018-07-25T21:46:34Z

Codecov Report

Merging #5487 into master will decrease coverage by 0.03%.
The diff coverage is 51.11%.

@@            Coverage Diff             @@
##           master    #5487      +/-   ##
==========================================
- Coverage   63.12%   63.08%   -0.04%     
==========================================
  Files         349      349              
  Lines       22167    22203      +36     
  Branches     2462     2462              
==========================================
+ Hits        13992    14006      +14     
- Misses       8161     8183      +22     
  Partials       14       14

Impacted Files	Coverage Δ
superset/dataframe.py	`94.59% <100%> (+0.09%)`	⬆️
superset/viz.py	`77.47% <100%> (+0.04%)`	⬆️
superset/db_engine_specs.py	`54.19% <31.25%> (-1.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d95c4c...c13415d. Read the comment docs.

mistercrunch · 2018-07-26T03:56:12Z

superset/viz.py

@@ -467,6 +468,36 @@ def get_data(self, df):
    def json_data(self):
        return json.dumps(self.data)

+    def fix_df_column_case(self, df):


This doesn't really belong in viz.py, I'd much rather have this in db_engine_spec.py and only execute for these weird database. Maybe it's a method in BaseEngineSpec that applies conditionally based no a class attribute.

Good idea, I'll move it over.

villebro · 2018-07-26T13:13:18Z

This should now be ready for a new round of review and testing. I've updated the description to reflect the current state of the PR.

@mistercrunch Can you take a new look at this?
@minh5 Do you have the opportunity to test this on Redshift?
@mmuru Could you test this against your Snowflake datasources?

Again, all comments more than welcome.

mistercrunch · 2018-07-26T16:59:57Z

LGTM would merge. Holding a bit for confirmation from people tagged here.

mmuru · 2018-07-26T19:04:30Z

@villebro: I verified your PR using your branch and it worked against Snowflake data source.

villebro · 2018-07-26T19:22:05Z

Thanks @mmuru for confirming!

villebro · 2018-07-27T06:42:48Z

@mistercrunch This has now been confirmed on both Snowflake and Redshift. I ended up making a small refactor after your review (commit get metrics names from utils) and rebasing, but apart from that the main functionality is unchanged. Feeling confident that this is now mergeworthy. Has been confirmed to fix #5489. Thanks @mmuru and @keeyong for testing!

minh5 · 2018-07-27T20:18:22Z

It appears to work for me. The UI gets a little weird when I'm not using Superset not installed from pip (this includes source). But I have a series of queries using Redshift and the server doesn't return the SUM(column_name) errors that I usually experience.

villebro · 2018-07-27T21:45:58Z

Thanks for testing @minh5 . Did you do the whole npm run dev etc routine as described in https://github.com/apache/incubator-superset/blob/master/CONTRIBUTING.md#setting-up-the-node--npm-javascript-environment ?

mistercrunch · 2018-07-27T23:59:45Z

FYI there are some unrelated bugs on master at the moment.

villebro · 2018-07-28T10:27:28Z

This also fixes #5353 , aka the handle_nulls() bug.

mistercrunch · 2018-07-31T06:47:41Z

Would merge, please resolve conflict

villebro · 2018-07-31T07:27:53Z

@mistercrunch FYI rebased and tested locally to work.

villebro · 2018-08-02T18:39:54Z

@mistercrunch Rebased and ready to merge again. py36-sqlite CI failure seems to be a technicality:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

villebro · 2018-08-03T04:53:35Z

Rebased to kickstart CI, now clean bill of health. Ready for merging @mistercrunch

…acle and Redshift (apache#5487) * Add function to fix dataframe column case * Fix broken handle_nulls method * Add case sensitivity option to dedup * Refactor function definition and call location * Remove added blank line * Move df column rename logit to db_engine_spec * Remove redundant variable * Update comments in db_engine_specs * Tie df adjustment to db_engine_spec class attribute * Fix dedup error * Linting * Check for db_engine_spec attribute prior to adjustment * Rename case sensitivity flag * Linting * Remove function that was moved to db_engine_specs * Get metrics names from utils * Remove double import and rename dedup variable

machinoAI · 2018-12-10T10:05:46Z

I am reading the data from redshift and trying to create chart in superset by grouping month wise but could not able to do since there is no option of grouping. What should I do ?

villebro · 2018-12-10T13:20:07Z

@adderRavi can you elaborate on what you are trying to do? Sounds like you want to use a month time grain. Also, if you are having trouble with RedShift I would appreciate if you could try #5827 as it fixes all problems I have been able to identify with RedShift and a few other SQL Alchemy engines.

villebro mentioned this pull request Jul 25, 2018

Force lowercase column names for Snowflake and Oracle #4994

Merged

mistercrunch reviewed Jul 26, 2018

View reviewed changes

villebro mentioned this pull request Jul 26, 2018

Implement schema uri insertion for Snowflake #5474

Closed

villebro changed the title ~~Match viz dataframe column case to form_data fields~~ Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift Jul 26, 2018

villebro mentioned this pull request Jul 26, 2018

SUM doesn't work somehow in Redshift table #5489

Closed

3 tasks

villebro mentioned this pull request Jul 28, 2018

Columns with NULL data throw "Unexpected Error" in Table View Chart #5353

Closed

3 tasks

villebro added 11 commits August 2, 2018 21:48

Add function to fix dataframe column case

4635f57

Fix broken handle_nulls method

4cc3f5a

Add case sensitivity option to dedup

6c9b63a

Refactor function definition and call location

25b7b1c

Remove added blank line

41b1381

Move df column rename logit to db_engine_spec

0ccc8a8

Remove redundant variable

3feacf2

Update comments in db_engine_specs

c9faf34

Tie df adjustment to db_engine_spec class attribute

a16122c

Fix dedup error

268754d

Linting

2819d9c

villebro added 6 commits August 2, 2018 21:48

Check for db_engine_spec attribute prior to adjustment

3b2ddb6

Rename case sensitivity flag

cc55738

Linting

8af8940

Remove function that was moved to db_engine_specs

ec042b9

Get metrics names from utils

31fbdf2

Remove double import and rename dedup variable

c13415d

mistercrunch merged commit e1f4db8 into apache:master Aug 3, 2018

villebro deleted the preprocess_df branch December 10, 2018 13:15

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

villebro commented Jul 25, 2018 •

edited

Loading

codecov-io commented Jul 25, 2018 •

edited

Loading

mistercrunch Jul 26, 2018

villebro Jul 26, 2018

villebro commented Jul 26, 2018 •

edited

Loading

mistercrunch commented Jul 26, 2018

mmuru commented Jul 26, 2018

villebro commented Jul 26, 2018

villebro commented Jul 27, 2018

minh5 commented Jul 27, 2018

villebro commented Jul 27, 2018

mistercrunch commented Jul 27, 2018

villebro commented Jul 28, 2018

mistercrunch commented Jul 31, 2018

villebro commented Jul 31, 2018

villebro commented Aug 2, 2018

villebro commented Aug 3, 2018

machinoAI commented Dec 10, 2018

villebro commented Dec 10, 2018

Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

Match viz dataframe column case to form_data fields for Snowflake, Oracle and Redshift #5487

Conversation

villebro commented Jul 25, 2018 • edited Loading

codecov-io commented Jul 25, 2018 • edited Loading

Codecov Report

mistercrunch Jul 26, 2018

Choose a reason for hiding this comment

villebro Jul 26, 2018

Choose a reason for hiding this comment

villebro commented Jul 26, 2018 • edited Loading

mistercrunch commented Jul 26, 2018

mmuru commented Jul 26, 2018

villebro commented Jul 26, 2018

villebro commented Jul 27, 2018

minh5 commented Jul 27, 2018

villebro commented Jul 27, 2018

mistercrunch commented Jul 27, 2018

villebro commented Jul 28, 2018

mistercrunch commented Jul 31, 2018

villebro commented Jul 31, 2018

villebro commented Aug 2, 2018

villebro commented Aug 3, 2018

machinoAI commented Dec 10, 2018

villebro commented Dec 10, 2018

villebro commented Jul 25, 2018 •

edited

Loading

codecov-io commented Jul 25, 2018 •

edited

Loading

villebro commented Jul 26, 2018 •

edited

Loading